Introduction to R for Data Analysis

First Rules

  1. Don’t use base R code if you can help it. Use tidyverse instead. I’ve done both—so take my word for it!

  2. Don’t use base R. Use RStudio!!! I will hunt you down if you ever open R instead of RStudio.

  3. Don’t get misled by the myths out there about SAS vs R (e.g., R stuff is “home-cooked” 😉😉)! Here’s a couple links debunking some of those myths (while perpetuating at least one other myth—more on this later): An old blogpost from 2014, part 1 and Part 2 of the blogpost.

    1. At the end I’ll provide more links to resources demonstrating the versatility, professional quality, and reliability of R.
pacman::p_load(
  readxl,
  sas7bdat,
  tidyverse,
  tidyquant,
  tidymodels,
  scales,
  plotly,
  gt,
  gtsummary,
  
  usmap,
  timetk,
  leaflet,
  
  survival,
  survminer,
  nlme,
  # lme4 # Not used here. But great for mixed models!
)
source("00_custom_functions.R")

Hands-on Basics

Much of the material presented here, and a lot more, can be found—in a very helpful way—at R for Data Science. This book is amazing! Ignore it at your own risk.

You can use R as a calculator:

3 + 5^3 - sin(4) / (pi*log10(100))
## [1] 128.1204

You can assign things to “objects”. For this, use alt+-, i.e., alt+dash:

X <- rnorm(100)
Y <- X + rnorm(100)

my_tibble <- tibble(
  x = X,
  y = Y
)

my_tibble

Note: If you want to generate, say, 100 random variates from the standard normal distribution, it’s as easy as rnorm(100). Lots of functionality for all the other common distributions are, naturally, available in R.

Here is X:

X
##   [1]  0.681712955  0.013504451  0.703414817 -1.343368874 -0.793282371
##   [6]  0.797989100  0.256266072 -0.228931486  0.845108020 -0.014505554
##  [11]  0.492617427  0.301186755  1.888087557  0.064143664  0.170707577
##  [16] -0.158929464 -0.788633307  0.486909846 -1.038239105 -2.175662833
##  [21] -0.018531016  1.933232775  0.004488554  0.769009211 -0.826111194
##  [26] -0.592886218 -1.578168730 -1.605119321  0.027811336 -0.761083859
##  [31] -0.156057469  0.962275170  0.859148686 -2.289144025  0.273567633
##  [36]  1.198585683 -2.528709949 -1.168268880 -1.212190654 -1.156183496
##  [41] -0.606609512  0.229673290  1.106938435  1.577256487  0.421750699
##  [46] -1.779598528  0.507373372 -1.666272014 -0.503166663 -0.677537383
##  [51]  1.078618808  0.491020720  1.644214916  0.229629310  1.711749109
##  [56]  0.247118861  0.576616988  0.082999165 -1.706847646  0.405091497
##  [61]  0.429761500 -0.345816971  0.480525755  0.262789109 -0.739894728
##  [66]  1.581302057  0.066614309  0.633735263  1.103032130 -1.898208296
##  [71]  0.087464081 -0.676715064  1.529062107 -0.774344610  0.088530551
##  [76] -0.508589677  0.578201211  0.792845481  0.393392602 -0.413063832
##  [81] -1.298529025 -1.441739666 -0.571473387 -0.645297746  0.653620533
##  [86]  0.136480709 -0.322389026  0.251299976  1.185847721  0.820856082
##  [91] -1.542853874  0.458678337  1.658766974  0.282623467  0.077315073
##  [96] -0.414286075  1.148338058 -0.134424408  2.480340832 -0.452971533
length(X)
## [1] 100

And here is Y:

Y
##   [1]  0.32897734 -1.17502403  2.85964549 -0.33428520  0.94278940  1.60083323
##   [7]  0.16073848 -1.20834325  0.85667030 -2.09890492  0.91706261 -0.48521211
##  [13]  0.84690276 -0.86507104  0.25651973 -2.30807864 -2.43950512  3.09613842
##  [19] -0.46584327 -0.43857338  1.57571068  2.57327979  1.02135065 -0.29001349
##  [25] -0.09653801 -0.59591307 -1.01318506 -2.93884977 -1.25075996 -0.50761196
##  [31]  0.43246855  1.70711827 -0.27067437 -2.30696225  0.43558663  1.21986036
##  [37] -2.52207857 -0.62272583 -1.01435578 -3.30504308 -0.83115265 -0.50516343
##  [43]  2.02839412  1.22377992 -1.28552915 -1.36856596 -0.08641231 -2.12056711
##  [49]  0.69902207  0.76878046  1.65386323 -0.23711295  0.39329561  0.65226856
##  [55]  2.40512600 -1.54124614 -0.45873223 -0.27321318 -1.88358798 -0.38848196
##  [61] -0.01980753  0.33786097 -0.02038072 -0.22888844 -1.79102999  0.40905058
##  [67] -0.10894170  0.62216238 -0.45502943 -0.17068127  1.46492281 -0.37001778
##  [73]  1.30559268 -1.94636728  0.99182919  0.43643385 -0.17380346  0.82774362
##  [79]  2.21261356 -1.04905564 -0.19652670 -1.80443074 -1.11070324 -1.76590919
##  [85]  0.61414345 -0.77340041 -0.15058227  1.04260950  1.27263345  1.40244118
##  [91] -1.45700215  0.63590712  0.83180966 -1.52844975 -0.05614839  0.96351309
##  [97]  1.90558674 -0.28655711  1.87158624  0.97079611

You can plot x and y:

my_tibble %>% 
  ggplot(aes(x,y)) + 
  geom_point() + 
  geom_smooth(method = "lm") + 
  theme_tq() +
  labs(
    title = "This is a scatter plot with a linear OLS fitted regression line",
    x = "X",
    y = "Y"
  )

You can write your own functions (and your own R libraries, if you want!):

make_a_demo_data_tibble <- function(X){
  
  # Make y
  Y <- X + rnorm(length(X))
  
  my_tibble <- tibble(
    x = X,
    y = Y
  )
}

foo <- make_a_demo_data_tibble(X = rnorm(200))

foo

There are easy ways of making vectors or regular sequences of numbers:

seq(-4,4,length.out = 100) %>% as_tibble()

The Pipe

What’s this %>% thing? It’s called the “pipe” operator. The hot-key for it is shift+ctrl+m, or shift+command+m if you’re on a mac. If you’re familiar with bash, it’s kind of like the pipe in bash code. It takes the output of the code before the pipe, and “pipes” it into the code after the pipe as the first argument in whatever function is sitting there. For example, this is what your code might look like without the pipe:

Z <- X - 2*rnorm(100)
binary_variable <- rbinom(100,1,.5)

mutate(my_tibble,z = Z)       -> my_tibble
mutate(my_tibble,`x*z` = X*Z) -> my_tibble
mutate(
  my_tibble,
  b_var = binary_variable
)                             -> my_tibble

my_tibble

Notice all the redundant calls of my_tibble and the mutate function. But thanks to the pipe (and some other nice tidyverse features), we don’t ever have to code like this again!! Let’s see what this would be like with the pipe:

# Start over with a fresh my_tibble
make_a_demo_data_tibble(rnorm(100)) -> my_tibble

my_tibble
# Now use the pipe, and the fact that you can create multiple columns inside just one
# mutate function.
my_tibble %>% 
  rowwise() %>% 
  mutate(
    z     = x - 2*rnorm(1),
    `x*z` = x*z,
    b_var = rbinom(1,1,.5)
  )  %>% ungroup() -> my_tibble

my_tibble

If by chance you don’t see the intuitive nature of the pipe yet, find someone who can help you familiarize yourself with it through repeated practice, and very soon it will be as intuitive as putting one foot in front of the other!

Also, if you’re wondering what rowwise and ungroup are all about, don’t sweat it! They’re fairly self-explanatory. rowwise allows you to do data transformations or operations row-by-row. Since that “groups” the data set into single-row groupings, you simply ungroup after you’re done using rowwise. You’ll get more familiar with it after you try it out a few times.

You may have noticed the -> assignment arrow turned around in some of the code above. What’s with this right-facing assignment arrow? Just another cool illustration of the flexibility of R. You can point the assignment arrow either way, just as long as it’s facing the object name, not the “stuff” you’re assigning to the object.

There are a bajillion other things to mention about first things in R Basics before moving on to the next R Basics topic of data wrangling, but hopefully this gives you a glimpse of what R is like.

Data Wrangling & Visualization

It’s been said, 80% of your quantitative analysis life is data wrangling. At least, it will be once you move outside of the classroom and into the real world.

This is just one more really big reason why you should use R.

Overview of Data Types

There are numerics, integers, character strings, factors, dates, date-times, and other data types. For example:

1:10
##  [1]  1  2  3  4  5  6  7  8  9 10
(1:10) %>% class()
## [1] "integer"
# Random uniform variates
runif(10)
##  [1] 0.66330441 0.25272306 0.44007174 0.09764087 0.47267968 0.30720518
##  [7] 0.89025356 0.60857034 0.75297315 0.51271489
runif(10) %>% class()
## [1] "numeric"
letters[1:3]
## [1] "a" "b" "c"
letters[1:3] %>% class()
## [1] "character"
letters[1:3] %>% factor()
## [1] a b c
## Levels: a b c
letters[1:3] %>% factor() %>% class()
## [1] "factor"
timetk::tk_make_timeseries("2021-10-01","2021-10-10")
##  [1] "2021-10-01" "2021-10-02" "2021-10-03" "2021-10-04" "2021-10-05"
##  [6] "2021-10-06" "2021-10-07" "2021-10-08" "2021-10-09" "2021-10-10"
timetk::tk_make_timeseries("2021-10-01","2021-10-10") %>% class()
## [1] "Date"

As with most of the stuff presented here, you can learn more here. But these data types are fairly self-explanatory.

So let’s dive in and wrangle some crappy data!

BLS Employment Data

We’ll import data already placed in our project environment, but you can also find this on the web here, under May 2020; just click on XLS for “State” to download.

Notice that we have the help of the janitor package to “clean up” the names. We can also select which columns we want, slice the data to look at only the rows we want, and even tell R which excel spreadsheet we want to import, as well as how many lines at the top of the spreadsheet to skip.

oes_tbl <- readxl::read_excel("00_data/state_M2020_dl.xlsx",sheet = 1) %>% 
  janitor::clean_names()

oes_dict <- readxl::read_excel("00_data/state_M2020_dl.xlsx",sheet = 2,skip = 8) %>%
  janitor::clean_names() %>% select(1:2) %>% slice(1:31)

Let’s say we’re interested in the percentile wage columns, for hourly wage. We can look at the columns that start with “h_” using the select and matches tidyverse functions:

oes_tbl %>% select(matches("h_"))

Notice that the data isn’t quite the format we want! It’s in the character “chr” format. No problem. We just use the mutate function to convert it to numeric (what the tidyverse calls “double”, or “dbl”), along with the across and matches functions, to make sure we mutate across all the columns whose names match the pattern “h_”. We can also select those columns to view them more easily, just to make sure we did it right, before storing the updated tibble:

oes_tbl %>% mutate(across(matches("h_"),as.numeric)) %>% select(matches("h_"))

Note that NAs were introduced! That’s okay. We should expect that when not everything in those columns was a number. It gives us a warning when that happens though just to make sure we’re aware it happened.

(Don’t get warnings and errors confused! Warnings are not errors. Warnings do not mean the code failed to execute. Warnings just let you know something particular occurred that you might want to know about.)

Now that we can see our changes worked, let’s store that output (along with a selection of only the columns we want) as a new object—called oes_2_tbl.

oes_tbl %>% 
  mutate(across(matches("h_"),as.numeric)) %>% 
  select(area_title,prim_state,occ_title,matches("h_")) %>% 
  rename(state_name = area_title) -> oes_2_tbl

Notice I also used rename to rename the column “area_title” to “state_name”, just for the heck of showing you the rename function. We’ll use another column-renaming function below.

Just curious, do we have all 50 states represented in our data? One visual way to confirm that is by plotting them on a map!

usmap::plot_usmap(include = oes_2_tbl %>% 
                    distinct(prim_state) %>% 
                    slice(1:45) %>% 
                    pull(prim_state),
                  fill = "gray70") 

Notice that I “sliced” my data to only rows 1 through 45. I did that on purpose to show you a map with some states missing. We actually do have all the states though, as this map below confirms:

usmap::plot_usmap(include = oes_2_tbl %>% 
                    distinct(prim_state) %>% 
                    slice(1:51) %>% 
                    pull(prim_state),
                  fill = "gray70") 

You would probably want a more sure-fire way of checking something like completeness in a list of states. What would you do in order to get an exact list of only the states you’re missing?

(HINT: Try an anti-join using what you know to be a complete list of the states. Anti-joins are something we can cover later.)

Now let’s say I want to plot the wages at the various percentiles listed for a dozen or so different occupations. Some of the occupation names are extremely, ridiculously long. What’s the most efficient, most time-productive way to do that?

Hmmm …

One idea that occurred to me is to take advantage of R’s great auto-complete feature! (Not featured in SAS 😦). You can try this out yourself in the auxiliary script for this BLS data exercise—bls_data_practice.R. In summary, I would do something like this:

oes_2_tbl %>% distinct(occ_title) %>% pull(occ_title) -> occupations

matrix(
  NA,
  nrow = 1,
  ncol = length(occupations)
) %>% 
  as.data.frame() %>% 
  as_tibble() %>% 
  set_names(occupations) -> occupations_tbl

Now I can simply type “occupations_tbl$” and hit tab. All the occupation titles display in the auto-complete pop-up! I can simply start typing anything that pops into my head and see if it’s there in this long list of 800+ different job titles!

Now I’ve got my own list of jobs I’m interested in researching, using copy and paste (came in handy for the really long names!):

selected_occupations <- c(
  "Dentists, General",
  "Physicians, All Other; and Ophthalmologists, Except Pediatric",
  "Physician Assistants",
  "Dental Hygienists",
  "Nurse Practitioners",
  "Management Occupations",
  "Sales Managers",
  "Chief Executives",
  "Legislators",
  "Epidemiologists",
  "Statisticians",
  "Software Developers and Software Quality Assurance Analysts and Testers"
)

And now I’m ready to make a really cool interactive plot.

oes_2_tbl %>% 
  filter(prim_state %in% c("OK","NY","TX","CA","UT","MA")) %>% 
  filter(occ_title %in% selected_occupations) %>% 
  filter(!(str_detect(occ_title,"Legisl"))) %>% 
  
  # Pivot longer
  pivot_longer(cols = h_mean:h_pct90) %>% 
  
  # Trim occupation title
  mutate(occ_title = str_sub(occ_title,end = 15)) %>% 
  
  # Plot
  ggplot(aes(name,value,color=occ_title,group=occ_title)) + 
  facet_wrap(~state_name) + 
  geom_point(size = 3) + 
  geom_line(size = 1.3) + 
  theme_tq() + scale_color_tq() + 
  theme(axis.text.x = element_text(angle = 30,hjust=.9,vjust=.9)) + 
  labs(title = "Hourly Wages by Occupation by State",
       x = "",y = "Dollars per Hour") -> g_wages

ggplotly(g_wages)

At this point you might be asking “What the heck is all that code supposed to mean?” Let’s break it down:

  1. First I filtered to just the states I was interested in. The %in% operator is similar to SAS’s similarly-named operator.
  2. Then I did another filter to select just the occupations in my selected_occupations character vector created in the previous code chunk.
  3. I then highlighted and ran (using command+enter, or ctrl+enter on Windows) just the code I had written by the end of step 2, and noticed that there is no data at all for legislator! So in step 3 I removed the rows that detected a “Legisl” pattern (when I code sometimes I’m too lazy to type out the entire word).
  4. Then I pivoted to a longer tibble, because R (specifically, ggplot2) can much more productively use long rather than wide data. Notice that if you type the name of one column, a colon, and then the name of a column somewhere to the right of that column (doesn’t matter how far away), R will select all the columns in between them, and including them, so you can save yourself the hassle of typing!
  5. Next I shortened the job titles to be 15 characters max.
  6. Then I plotted the data.

Here’s the plot code breakdown:

  1. Inside ggplot, you use the aesthetics function called aes to plug in your x and y variables. Since pivot_longer defaults to calling the newly created key column “name” and the column with the corresponding values “value”, I just plugged in name for x, and value for y. Note: You can highlight and run the code only up to (and including) the pivot_longer line, and see for yourself what those columns look like! Try it in the auxiliary script! (i.e., bls_data_practice.R)
  2. facet_wrap creates cool little facets—one for each state.
  3. geom_point plots the data as points, and geom_line plots lines connecting the points that you tell R are supposed to be connected (note that I did this with the group argument inside the aesthetics).
  4. theme_tq and scale_color_tq simply modify the theme and the colors of the plot to ones that I think look great.
  5. All that stuff in the theme function just angles the x axis text, so as to make it legible.
  6. labs adds your plot labels, of course.

I used the right-facing assignment arrow to dump the output of all that code into an object I called “g_wages”. I use the letter “g” for “ggplot” plot objects.

Then I passed g_wages into the super awesome interactive plots function called ggplotly from the plotly library. And voila! Pretty cool. Try clicking on stuff once, and also twice, in the legend, and in the plot, to see what happens. You can look at any subset of jobs you like just by clicking on them! You can also hover over the plots to see exactly what the figures are for each job.

Next Steps

There are so many other things that you can easily do, and easily learn to do, to wrangle all the crappy things you’ll find in real world data. I can’t even begin to tell you how much easier your life will be if you use the tidyverse to wrangle and visualize data. See Joey Hadley’s Lynda Learning course on the tidyverse for more, or see if I or someone else here can help with a particular question. Don’t be afraid to ask! I’m happy to help.

Biostatistics Core Methods

This section very very briefly exhibits the tip of the iceberg of things you can do in the following biostats methods:

  • multiple linear regression (biostats II)

  • analysis of frequency data (e.g., logistic regression)

  • survival analysis (e.g., Cox PH regression)

  • longitudinal data analysis

  • infectious disease methods (creating and plotting adjustable SEIR models with R Shiny)

  • multivariate analysis (e.g., machine learning methods, using tidymodels)

I know from personal experience that you can do everything taught in each of those courses, plus other courses, in R.

There is a LOT more; this is just a brief intro.

Multiple Linear Regression

The burning question on everyone’s mind: Is highway miles per gallon associated with manufacturer, number of cylinders, type of drive train, and does it improve with time (comparing the year 1999 to 2008)?

Call the mpg dataset (from the tidyverse).

mpg

Pretty easy!

See this tidyverse help page for information about the meaning of the variable names.

Now store it as an object (not actually necessary).

mpg_tbl <- mpg

With minimal data preparation, we will be ready to fit a multivariable linear model.

Let’s make Ford the reference group for the manufacturer variable (for no particular reason other than that it’s a familiar manufacturer).

mpg_tbl %>% 
  mutate(
    year = factor(year),
    manufacturer = factor(manufacturer),
    manufacturer = relevel(manufacturer,ref = "ford")
  ) -> mpg_tbl

Now fit a linear model!

We’re just choosing a few of the variables available. We will not include interactions, for the sake of simplicity since this is just an intro to R.

fit_lm <- lm(
  hwy ~ drv + cyl + year + manufacturer,
  data = mpg_tbl
)

The syntax is nice. It’s kind of like writing a formula, inside the lm() function.

Now spit out a table for people to see the regression results.

fit_lm %>% tbl_regression()
Characteristic Beta 95% CI1 p-value
drv
4
f 6.0 4.8, 7.2 <0.001
r 4.6 3.0, 6.2 <0.001
cyl -1.5 -1.9, -1.2 <0.001
year
1999
2008 1.1 0.35, 1.8 0.004
manufacturer
ford
audi 3.8 1.9, 5.7 <0.001
chevrolet 0.60 -1.0, 2.2 0.5
dodge -1.4 -3.0, 0.25 0.10
honda 4.4 2.0, 6.9 <0.001
hyundai -0.07 -2.2, 2.1 >0.9
jeep 0.17 -2.1, 2.5 0.9
land rover 0.46 -2.5, 3.4 0.8
lincoln -3.4 -6.8, -0.09 0.044
mercury 0.42 -2.5, 3.4 0.8
nissan 0.61 -1.5, 2.7 0.6
pontiac 2.0 -0.84, 4.9 0.2
subaru 3.3 1.0, 5.6 0.004
toyota 1.2 -0.52, 2.9 0.2
volkswagen 2.1 0.13, 4.0 0.037

1 CI = Confidence Interval

And spit out a dot-whisker plot for MUCH better visualization of the regression results.

fit_lm %>% tidied_model_lm() %>% 
  plot_dotwhisker(
    my_title = "Highway Miles Per Gallon Adjusted Linear Relationships"
  )

See why I tried to limit the number of variables in the model? Categorical variables are actually not just one variable!

There is a great deal of flexibility and versatility with both the model prep, the model fitting, and the communication of results—this is just the tip of the iceberg! Just one example of what you can do.

For example, when it comes to communicating or displaying model results, you can opt for no labels on the dot-whisker plot:

fit_lm %>% tidied_model_lm() %>% 
  plot_dotwhisker(
    my_title = "Highway Miles Per Gallon Adjusted Linear Relationships",
    labels = F
  )

Multiple Logistic Regression

Are nodule presence and nature, as well as prostate specific antigen (PSA) concentration, associated with capsule penetration of a prostate tumor?

A logistic regression model would be virtually the same, syntax-wise, except we use glm() instead of lm().

First, prepare the data to be properly analyzed. We’ll want to make capsule (the variable for capsule penetration yes/no, i.e., 1/0) a factor with the reference group set to the 0 (no penetration).

We’ll leave nodule nature as a numeric (i.e., an ordinal variable, with possibilities 1, 2, 3, and 4, representing no nodule, unilobar left, unilobar right, and bilobar, respectively).

prostate_tbl <- read_csv("00_data/Prostate_Frequency_Class.csv")

prostate_tbl %>% 
  mutate(
    capsule = factor(capsule),
    capsule = relevel(capsule,ref = "0")
  ) -> prostate_tbl

prostate_tbl

Now fit the model. The only new thing is that you specify a link function—in this case the logit link, which is the default when you set the family argument equal to binomial.

fit_glm <- glm(
  capsule ~ dpros + psa,
  data = prostate_tbl,
  family = binomial
)

Now spit out the results.

First the table of the results:

fit_glm %>% tbl_regression(exponentiate = T)
Characteristic OR1 95% CI1 p-value
dpros 1.90 1.50, 2.43 <0.001
psa 1.05 1.03, 1.07 <0.001

1 OR = Odds Ratio, CI = Confidence Interval

Then a dot whisker figure:

fit_glm %>% tidied_model_glm() %>% 
  plot_dotwhisker(
    my_title = "Adjusted Odds Ratios for the Model Covariates",
    x_axis   = "Odds Ratios (and 95% CIs)",
    x_intercept = 1
  )

Cool beans.

Survival Analysis (Cox PH Regression)

Question: Is overall survival for ovarian cancer patients associated with age, cancer stage (as a continuous variable), and cancer antigen 125?

The survival package is great for this. Other supporting packages include, primarily, survminer, such as for plotting data using the function ggsurvplot.

Load the data:

ovarian_tbl <- readxl::read_excel("00_data/17. CA125 in Ovary Cancer.xlsx") %>% 
  janitor::clean_names()

ovarian_tbl

Prep the data:

ovarian_tbl %>% 
  mutate(
    # Prep age_group
    age_group = recode(age_group,
                       `1` = "Up to 55 yo",
                       `2` = "Over 55 yo"),
    age_group = factor(age_group),
    age_group = relevel(age_group,ref = "Up to 55 yo"),
    
    # Prep ca125_level
    ca125_level = recode(ca125_level,
                         `1` = "Normal",
                         `2` = "High"),
    ca125_level = factor(ca125_level),
    ca125_level = relevel(ca125_level,ref = "Normal")
  ) -> ovarian_tbl

Fit the model:

fit_coxph <- coxph(
  Surv(os,status) ~ age_group + ca125_level + stage,
  data = ovarian_tbl
)

Share the results:

fit_coxph %>% tbl_regression(exponentiate = T)
Characteristic HR1 95% CI1 p-value
age_group
Up to 55 yo
Over 55 yo 3.04 2.09, 4.42 <0.001
ca125_level
Normal
High 4.98 2.69, 9.24 <0.001
stage 2.58 1.99, 3.34 <0.001

1 HR = Hazard Ratio, CI = Confidence Interval

fit_coxph %>% tidied_model_glm() %>% 
  plot_dotwhisker(
    my_title = "Adjusted Hazard Ratios for the Model Covariates",
    x_axis = "Hazard Ratios (and 95% CIs)",
    x_intercept = 1
  )

Longitudinal Data Analysis

Question: Is it lots of fun to fit a one-knot spline model, or just a little bit fun?

Load some longitudinal data about smoking and FEV, courtesy of our Applied Longitudinal Analysis textbook.

smoking_tbl <- sas7bdat::read.sas7bdat("00_data/smoking.sas7bdat") %>% 
    as_tibble() %>% 
    janitor::clean_names()

Some response profile analysis data transformation (don’t worry about this, hopefully with some time spent trying it out step by step on your own, it will make sense!):

smoking_tbl %>% 
    pivot_wider(names_from = time,values_from = fev1) %>% 
    mutate(baseline = `0`) %>% 
    pivot_longer(cols = `0`:`12`) %>% 
    rename(time = name,fev1 = value) %>% 
    mutate(
        change    = fev1 - baseline,
        time_rank = recode(time,
                           `0` = 1,
                           `3` = 2,
                           `6` = 3,
                           `9` = 4,
                           `12`= 5,
                           `15`= 6,
                           `19`= 7),
        time.f = factor(time,c(0,3,6,9,12,15,19)),
        time   = as.numeric(time),
        smoker = factor(smoker,c(0,1)), # The levels of smoker are 0 and 1
        id     = factor(id)
    ) %>% 
    arrange(id,time_rank) -> smoking_new_tbl

The knot:

smoking_new_tbl %>% 
    rowwise() %>% 
    mutate(
        # The knot:
        years_post_6 = (time - 6)*I(time >= 6) %>% as.numeric()
    ) %>% 
    ungroup() -> smoking_new_tbl

The model:

fit_F1_true <- gls(
  
    # The model formula
    fev1 ~  time + years_post_6 + smoker + smoker:time + smoker:years_post_6,
    
    corr    = corSymm(form = ~time_rank|id),    # Unstructured within-subject correlation 
    weights = varIdent(form = ~1|time_rank),    # Homogeneous variance across time
    data    = smoking_new_tbl %>% drop_na(fev1) # Drop FEV NAs or your results will be wrong
)

By the way, you can write \(\LaTeX\) directly in RMarkdown! Just enclose in single dollar signs for in-line LaTeX, and double dollar signs (not visible here, but you will see them in the .Rmd file) for LaTeX “chunks”:

\[ Y_{ij} = \beta_1 + \beta_2Time_{ij} + \beta_3(Time_{ij} - 6)_+ + \beta_4Group_i + \beta_5Time_{ij}*Group_i + \beta_6(Time_{ij} - 6)_+*Group_i + e_{ij} \] Here’s the one-knot spline model with estimates coefficients:

\[ \hat{Y}_{ij} = 3.55 - 0.043Time_{ij} + 0.013(Time_{ij} - 6)_+ - 0.33Group_i + 0.014Time_{ij}*Group_i - 0.025(Time_{ij} - 6)_+*Group_i \]

Pretty cool 😎.

Infectious Disease Methods

To check out my HIV model (and a generic SEIR) model, visit my Shiny app here.

Shiny is the the way to easily make web apps just by writing R code! It opens up a whole new world for communicating your results to stakeholders. It allows them to interact with machine learning-based predictive models you would like to deploy, for example.

Multivariate Analysis - Machine Learning

I can’t begin to tell you how fun and exciting machine learning is in R with tidymodels. So I won’t try. Instead, I’ll simply refer you to Julia Silge’s AMAZING YouTube channel on tidymodels and machine learning, linked here.

For now, suffice it to say that if you are not doing machine learning using tidymodels, you are NOT doing machine learning the right way unless you’re living ten years ago!

Machine learning is a BIG topic, with lots of methods, and lots of great tools in the tidymodels ecosystem.

If you want to learn more feel free to reach out to me I’d be happy to help with what I can.

FYI, you can do everything in the BSE multivariate analysis course in R, including the stuff that we did in SAS. And everything you learned to do in that course in R, you can do more easily, painlessly and richly if you use tidyverse and tidymodels instead of base R.

Miscellaneous Cool Stuff

Maps

You can make interactive maps in R with leaflet (like JavaScript’s Leaflet maps).

In this map below, we have a list of Oklahoma County high schools scraped from Wikipedia. They have been geocoded in R.

school_coords_tbl <- read_rds("00_data/OK_county_high_schools_coords.rds")

school_coords_tbl %>% my_jitter_coords(sd=.001) -> school_jitter_tbl
school_jitter_tbl %>% 
  leaflet() %>% 
  addProviderTiles("CartoDB.Voyager") %>% 
  setView(lng = -97.5,lat = 35.5, zoom = 10) %>% 
  addCircleMarkers(
    label = ~labels,
    lat = ~lat,
    lng = ~lng,
    radius = 7
  )

The code behind all that went into this (not too much code!) is in maps_and_web_scraping.R, also in this GitHub repo.

Dashboards

See flexdashboard_and_dygraphs.Rmd (and the corresponding html file).

Miscellaneous

There are SO many things we did not cover here, obviously. R is like a never-ending playground; only your research interests are the limit.

Here are a few random sources for more cool things to explore:

  • A gallery of extensions to ggplot2 for all kinds of data visualization, here. THIS GALLERY IS A MUST-SEE.

  • A flexdashboard gallery here. ALSO A MUST-SEE.

  • Don’t miss … The Big Book of R. It has every single online book about doing cool stuff in R!!

  • Joey Hadley’s intro to the tidyverse on Lynda Learning.

  • Side-by-side sample SAS code and R code for pretty much all the longitudinal data analysis exercises, provided by the authors of our BSE course’s textbook, here.

  • And don’t forget to check out stackoverflow.com where you will find answers to literally 90% of your questions about how to do anything R (as well as most other languages).

  • Lastly, check out my thesis! That way for sure more than two people will have read it. It was all completely 100% done and written and knit to PDF in R. See here.